Taking a quick view of the data

# For the 2019 data set
dim(data2019)
## [1] 39 17
head(data2019, 10)
## # A tibble: 10 × 17
##    ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11 ...12 ...13
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>   <NA> <NA>  <NA>   <NA>
##  2 Loca… Resi… <NA>  <NA>  <NA>  <NA>  <NA>  Solar <NA>   <NA> <NA>  <NA>  "Non…
##  3 <NA>  Dail… MWh   <NA>  <NA>  Cust… <NA>  Numb… <NA>  "Gen… <NA>  Ener… "MWh"
##  4 <NA>  <NA>  Gene… Off … Total Off … Total Res   Non-… "Res… Non-… <NA>   <NA>
##  5 BAYS… 12.4… 2756… 2947… 3051… 1552… 6697… 2842  190   "917… 3900… 5848… "136…
##  6 BURW… 12.8… 6442… 3814… 6824… 2241… 1455… 684   35    "222… 631.… 1434… "414…
##  7 CANA… 13.4… 1784… 8682… 1870… 5259… 3805… 1601  85    "588… 3572… 3427… "678…
##  8 CANT… 15.2… 6444… 8135… 7258… 3701… 1299… 9947  561   "320… 1613… 2038… "231…
##  9 CENT… 16.4… 7375… 1699… 9074… 8133… 1510… 21015 674   "754… 1918… 5525… "222…
## 10 CESS… 18.9… 1451… 2206… 1671… 1076… 2412… 4329  204   "194… 4744… 1192… "431…
## # … with 4 more variables: ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>
# For the 2020 data set
dim(data2020)
## [1] 41 17
head(data2020, 10)
## # A tibble: 10 × 17
##    ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11 ...12 ...13
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Regi… Loca… Resi… <NA>  <NA>  <NA>  <NA>  <NA>  Solar <NA>   <NA> <NA>  <NA> 
##  2 <NA>  <NA>  Dail… MWh   <NA>  <NA>  Cust… <NA>  Numb… <NA>  "Gen… <NA>  Ener…
##  3 <NA>  <NA>  <NA>  Gene… Off … Total Off … Total Res   Non-… "Res… Non-… <NA> 
##  4 Sydn… BAYS… 12.4  2778… 28618 3064… 15314 67909 3134  212   "113… 4871  7584 
##  5 <NA>  BURW… 12.6  63676 3638  67314 2196  14600 753   38    "273… 674   1696 
##  6 <NA>  CANA… 13.3  1778… 8305  1861… 5075  38223 1805  99    "729… 3950  4406 
##  7 <NA>  CANT… 15.1  6416… 79006 7206… 36476 1310… 10900 631   "389… 17862 26313
##  8 <NA>  CUMB… 13.1  1019… 8697  1106… 4027  23149 1908  129   "619… 4167  4519 
##  9 <NA>  GEOR… 14.6  2699… 39761 3097… 19182 58079 4351  181   "159… 3850  9931 
## 10 <NA>  HORN… 18.2  3013… 49437 3507… 20861 52676 6789  180   "282… 4519  16246
## # … with 4 more variables: ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>
# For the 2023 data set
dim(data2023)
## [1] 41 17
head(data2023, 10)
## # A tibble: 10 × 17
##    ...1  ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10 ...11 ...12 ...13
##    <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 Regi… Loca… Resi… <NA>  <NA>  <NA>  <NA>  <NA>  Solar <NA>   <NA> <NA>  <NA> 
##  2 <NA>  <NA>  Dail… MWh   <NA>  <NA>  Cust… <NA>  Numb… <NA>  "Gen… <NA>  Ener…
##  3 <NA>  <NA>  <NA>  Gene… Off … Total Off … Total Res   Non-… "Res… Non-… <NA> 
##  4 Sydn… BAYS… 12.6… 2882… 2733… 3155… 14792 68610 4381  279   "211… 6862… 1605…
##  5 <NA>  BURW… 12.4… 6441… 3376… 6779… 2101  14861 995   52    "469… 988.… 3398…
##  6 <NA>  CANA… 13.5… 1830… 7881… 1909… 4765  38740 2490  117   "131… 4284… 9463…
##  7 <NA>  CANT… 15.0… 6567… 7574… 7325… 35055 1334… 14679 823   "696… 2682… 5498…
##  8 <NA>  CUMB… 12.8… 1012… 8201… 1094… 3853  23410 2550  179   "115… 6067… 9777…
##  9 <NA>  GEOR… 14.6… 2772… 3853… 3157… 18522 59056 6046  234   "307… 6387… 2171…
## 10 <NA>  HORN… 18.4… 3095… 4985… 3594… 20435 53315 9222  222   "495… 6107… 3195…
## # … with 4 more variables: ...14 <chr>, ...15 <chr>, ...16 <chr>, ...17 <chr>

0.1 Data source verification

This data source is from Ausgrid, which is a large electricity provider that has more than 1.7 million customers in Sydney, Australia.Although this is a electricty provider that is run by a company, however, it must be subjected to a range of regulations and guidelines to ensure the accuracy of their data, including the Australian Energy Regulator(AER), according to the Australian energy policies.

Some useful functions

wrap.it <- function(x, len)
{ 
  sapply(x, function(y) paste(strwrap(y, len), 
                              collapse = "\n"), 
         USE.NAMES = FALSE)
}

wrap.labels <- function(x, len)
{
  if (is.list(x))
  {
    lapply(x, wrap.it, len)
  } else {
    wrap.it(x, len)
  }
}

0.2 IDA

To get an idea of the new data, we will do a quick bargraph of the ‘general supply’ columns from the excel file. (It is with several trial and error, that we have found out the original excel file is not suitable for R data analysis, therefore manual excel operation was done to extract it out into a form which was useful.)

require(readxl)
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
options(scipen = 999)

# Read in files
extracted_data <- read_excel("data/Electricity_consumption(extracted).xlsx")
extracted_data <- arrange(extracted_data)

# Setting up the output graph
par(mar = c(5, 8, 0, 2), mgp = c(4, 1, 0))

# The 2019 Dataset
values2019 <- extracted_data$`2019-R`
suburbs2019 <- extracted_data$H2019
barplot(values2019, names.arg=wrap.labels(rev(suburbs2019), 20), 
        col="red", xlab=" Electricity consumption/MWh", horiz=T, las=2, cex.names = 0.7)

# The 2020 Dataset
values2020 <- extracted_data$`2020-R`
suburbs2020 <- extracted_data$H2020
barplot(values2020, names.arg=wrap.labels(rev(suburbs2020), 20), 
        col="green", xlab=" Electricity consumption/MWh", horiz=T, las=2, cex.names = 0.7)

# The 2021 Dataset
values2021 <- extracted_data$`2021-R`
suburbs2021 <- extracted_data$H2021
barplot(values2021, names.arg=wrap.labels(rev(suburbs2021), 20), 
        col="blue", xlab=" Electricity consumption/MWh", horiz=T, las=2, cex.names = 0.7)

## Research question 1:

0.2.1 For ‘General Supply’

lin_reg <- function(l1, l2){
  data <- data.frame(x=l1, y=l2)
  model <- lm(y ~ x, data=data)
  slope <- summary(model)$coefficients[2,1]
  
  plot(data$x, data$y, xlab="Customer number", ylab="Electricity consumption/MWh")
  abline(model)
}
# Linear regression for 2019
lin2019G <- lin_reg(extracted_data$`2019-RP`, values2019)

# Linear regression for 2021
lin2021G <- lin_reg(extracted_data$`2021-RP`, values2021)

For ‘Non-residential medium sites’

values2019M <- extracted_data$`2019-M`
lin2019M <- lin_reg(extracted_data$`2019-MP`,values2019M)

values2021M <- extracted_data$`2021-M`
lin2021M <- lin_reg(extracted_data$`2021-MP`,values2021M)

For ‘Non-residential large sites’

values2019L <- extracted_data$`2019-L`
lin2019L <- lin_reg(extracted_data$`2019-LP`,values2019L)

values2021L <- extracted_data$`2021-L`
lin2021L <- lin_reg(extracted_data$`2021-LP`,values2021L)

As can see, the linear regression of all three graphs have insignifcant differences. One possible reason for this is that during the pandemic in 2020, we observed several lockdowns where people were forced to stay home. Businesses began to shut down and jobs transitioned from offices to working from home. This may have played a major role in the electrical usage from 2019-2020 as usage in commercial areas decreased while residential areas increased.The effects of both of these population movement balanced each other out, and thus the relationship between the population and electricity consumption did not change.

0.3 Research question 2:

How does the electrical consumption vary for different usages such as the general supply, small-medium sites and large sites. To answer this question, bar plots and normal distribution graphs will be used to visualise data. The general supply refers to electricity used for residential purposes. The small-medium sites refer to non-residential customers utilising less than 160MWh per year while large sites use more than 160MWh per year.
To get an general idea of which areas had the highest amount of electricity consumption, normal distribution graph was made. (We’ll use the latest complete dataset, which was the 2021 dataset) General Supply

distribution <- function(title, name, values){
  df <- data.frame(label=name, value=values)
  df_sorted <- df[order(df$value, decreasing = FALSE),]
  mean_val <- mean(values)
  sd_val <- sd(values)
  par(mar = c(5, 6, 2, 2), mgp = c(4, 1, 0), fig = c(0, 1, 0, 1), fin = c(6, 4))
  barplot(df_sorted$value, names.arg=wrap.labels(df_sorted$label, 20), 
        col="blue", xlab=title, horiz=T, las=2, cex.names = 0.4, , cex.lab=0.7, cex.axis=0.7)
  hist(values, breaks=10, freq=FALSE, col="grey", xlab="Values", main="Normal Distribution")
  curve(dnorm(x, mean=mean_val, sd=sd_val), add=TRUE, col="blue")
  abline(v = mean_val, col = "red", lwd = 2)
}
distribution("Electricity Consumption/MWh", suburbs2021, values2021)

In the general supply bar graph, Sydney utilises the highest amount of electricity compared to the other suburbs. This can be explained as it has the highest population density meaning more residents. The normal distribution curve is skewed to the right with the mean consumption being just over 200000 MWh.

‘Non-residential small-medium sites’

distribution("Electricity Consumption/MWh",suburbs2021, values2021M)

In the small-medium site graph, Sydney is also the highest consuming suburb which is explained by the high number of businesses that operate within the area.

‘Non-residential large sites’

distribution("Electricity Consumption/MWh",suburbs2021, values2021L)

Finally, to make a conclusion, a pie chart is made

par(mar = c(5, 6, 2, 2), mgp = c(4, 1, 0), fig = c(0, 1, 0, 1), fin = c(6, 4))
sum_2021G <- sum(values2021)
sum_2021M <- sum(values2021M)
sum_2021L <- sum(values2021L)
pie_values <- c(sum_2021G, sum_2021M, sum_2021L)
pie_labels <- c("General Supply", "Medium Sites", "Large Sites")
pie(pie_values, labels = pie_labels)

As can see, this shows a nuanced side of the distribution of electricity consumption by Ausgrid, as Ausgrid is the largest distributor of electricity on the east coast, there’s a high chance that this represents the total electricity consumption of NSW. If we’re talking about large areas like shopping malls, business skyscrapers or tourist attractions, than population does matter, in fact a lot, so places where there tends to be large flow of population, like Sydney(The normal distribution graph shows how skewed this situation is), actually uses the most amount electrcity. However, if we’re talking about residential supplies, which corresponds to the ‘General Supply’ column of the data source.Than population matters less, because with residential usage, the amount of electricity consumption depends on the size of the housing, which is why I guess Central Coast has the most amount of electricity (Note: Again, just to be cautious, it is considered that Ausgrid could’ve had a target customer base, which could explain part of this situation as well)

0.4 References

jrara. (17 C.E., November). plot - R barplot: wrapping long text labels? Stack Overflow. https://stackoverflow.com/questions/20241065/r-barplot-wrapping-long-text-labels